19 research outputs found
Using accelerators to speed up scientific and engineering codes: perspectives and problems
Accelerators are quickly emerging as the leading technology to further boost
computing performances; their main feature is a massively parallel on-chip architecture. NVIDIA
and AMD GPUs and the Intel Xeon-Phi are examples of accelerators available today. Accelerators
are power-efficient and deliver up to one order of magnitude more peak performance than
traditional CPUs. However, existing codes for traditional CPUs require substantial changes to
run efficiently on accelerators, including rewriting with specific programming languages.
In this contribution we present our experience in porting large codes to NVIDIA GPU and Intel
Xeon-Phi accelerators. Our reference application is a CFD code based on the Lattice
Boltzmann (LB) method. The regular structure of LB algorithms makes them suitable for
processor architectures with a large degree of parallelism. However, the challenge of
exploiting a large fraction of the theoretically available performance is not easy to
met. We consider a state-of-the-art two-dimensional LB model based on 37 populations (a
D2Q37 model), that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the
equation-of-state of a perfect gas.
We describe in details how we implement and optimize our LB code for Xeon-Phi and
GPUs, and then analyze performances on single- and multi-accelerator systems. We
finally compare results with those available on recent traditional multi-core CPUs
Nature of the spin-glass phase at experimental length scales
We present a massive equilibrium simulation of the three-dimensional Ising
spin glass at low temperatures. The Janus special-purpose computer has allowed
us to equilibrate, using parallel tempering, L=32 lattices down to T=0.64 Tc.
We demonstrate the relevance of equilibrium finite-size simulations to
understand experimental non-equilibrium spin glasses in the thermodynamical
limit by establishing a time-length dictionary. We conclude that
non-equilibrium experiments performed on a time scale of one hour can be
matched with equilibrium results on L=110 lattices. A detailed investigation of
the probability distribution functions of the spin and link overlap, as well as
of their correlation functions, shows that Replica Symmetry Breaking is the
appropriate theoretical framework for the physically relevant length scales.
Besides, we improve over existing methodologies to ensure equilibration in
parallel tempering simulations.Comment: 48 pages, 19 postscript figures, 9 tables. Version accepted for
publication in the Journal of Statistical Mechanic
Simulating spin systems on IANUS, an FPGA-based computer
We describe the hardwired implementation of algorithms for Monte Carlo
simulations of a large class of spin models. We have implemented these
algorithms as VHDL codes and we have mapped them onto a dedicated processor
based on a large FPGA device. The measured performance on one such processor is
comparable to O(100) carefully programmed high-end PCs: it turns out to be even
better for some selected spin models. We describe here codes that we are
currently executing on the IANUS massively parallel FPGA-based system.Comment: 19 pages, 8 figures; submitted to Computer Physics Communication
Janus II: a new generation application-driven computer for spin-system simulations
This paper describes the architecture, the development and the implementation
of Janus II, a new generation application-driven number cruncher optimized for
Monte Carlo simulations of spin systems (mainly spin glasses). This domain of
computational physics is a recognized grand challenge of high-performance
computing: the resources necessary to study in detail theoretical models that
can make contact with experimental data are by far beyond those available using
commodity computer systems. On the other hand, several specific features of the
associated algorithms suggest that unconventional computer architectures, which
can be implemented with available electronics technologies, may lead to order
of magnitude increases in performance, reducing to acceptable values on human
scales the time needed to carry out simulation campaigns that would take
centuries on commercially available machines. Janus II is one such machine,
recently developed and commissioned, that builds upon and improves on the
successful JANUS machine, which has been used for physics since 2008 and is
still in operation today. This paper describes in detail the motivations behind
the project, the computational requirements, the architecture and the
implementation of this new machine and compares its expected performances with
those of currently available commercial systems.Comment: 28 pages, 6 figure
Temperature chaos is present in off-equilibrium spin-glass dynamics
Experiments featuring non-equilibrium glassy dynamics under temperature changes still await interpretation. There is a widespread feeling that temperature chaos (an extreme sensitivity of the glass to temperature changes) should play a major role but, up to now, this phenomenon has been investigated solely under equilibrium conditions. In fact, the very existence of a chaotic effect in the non-equilibrium dynamics is yet to be established. In this article, we tackle this problem through a large simulation of the 3D Edwards-Anderson model, carried out on the Janus II supercomputer. We find a dynamic effect that closely parallels equilibrium temperature chaos. This dynamic temperature-chaos effect is spatially heterogeneous to a large degree and turns out to be controlled by the spin-glass coherence length ¿. Indeed, an emerging length-scale ¿* rules the crossover from weak (at ¿ « ¿*) to strong chaos (¿ » ¿*). Extrapolations of ¿* to relevant experimental conditions are provided. © 2021, The Author(s)
Early Experience on Porting and Running a Lattice Boltzmann Code on the Xeon-Phi Co-Processor
In this paper we report on our early experience on porting, optimizing and benchmarking a Lattice Boltzmann (LB) code on the Xeon-Phi co-processor, the first generally available version of the new Many Integrated Core (MIC) architecture, developed by Intel. We consider as a test-bed a state-of-the-art LB model, that accurately reproduces the thermo-hydrodynamics of a 2D- fluid obeying the equations of state of a perfect gas. The regular structure of LB algorithms makes it relatively easy to identify a large degree of available parallelism. However, mapping a large fraction of this parallelism onto this new class of processors is not straightforward. The D2Q37 LB algorithm considered in this paper is an appropriate test-bed for this architecture since the critical computing kernels require high performances both in terms of memory bandwidth for sparse memory access patterns and number crunching capability. We describe our implementation of the code, that builds on previous experience made on other (simpler) many-core processors and GPUs, present benchmark results and measure performances, and finally compare with the results obtained by previous implementations developed on state-of-the-art classic multi-core CPUs and GP-GPUs
Implementation and Optimization of a Thermal Lattice Boltzmann Algorithm on a multi-GPU cluster
Lattice Boltzmann (LB) methods are widely used today to describe the dynamics of fluids. Key advantages of this approach are the relative ease with which complex physics behavior, e.g. associated to multi-phase flows or irregular boundary conditions can be modeled, and -- from a computational perspective -- the large degree of available parallelism, that can be easily exploited on massively parallel systems. The advent of multi-core and many-core processors, including General Purpose Graphics Processing Unit (GP-GPU), has pushed the quest for parallelization also at the intra-processor level. From this point of view, LB methods may strongly benefit from these new architectures. In this paper we describe the implementation and optimization of a recently proposed thermal LB model -- the so called D2Q37 model -- on multi-GPU systems. We describe in details the ptimization techniques that we have used at both the intra-processor and inter-processor level, present performance and scaling figures and analyze bottlenecks associated to this implementation
A multi-GPU implementation of a D2Q37 lattice Boltzmann code
We describe a parallel implementation of a compressible Lattice Boltzmann code on a multi-GPU cluster based on Nvidia Fermi processors. We analyze how to optimize the algorithm for GP-GPU architectures, describe the implementation choices that we have adopted and compare our performance results with an implementation optimized for latest generation multi-core CPUs. Our program runs at ˜¿30% of the double-precision peak performance of one GPU and shows almost linear scaling when run on the multi-GPU cluster. Keywords: Computational fluid-dynamics – Lattice Boltzmann methods – GP-GPUs computin